In response to the growing demand by Zillow for a housing market model that incorporates local context, we have been tasked with leverage data from multiple sources, including Philadelphia Open Data, U.S Census Bureau, and the American Community Survey, to gather comprehensive information that enables us to develop an accurate and reliable model for the Philadelphia housing market.
This report outlines our approach and methods and the importance of understand the local context that is integrated with pre-existing data. We aim to provide greater context and a precise understanding of Philadelphia’s housing scene for present and future use.
In order to develop an enhanced understanding on the housing market, we acquired data that looks beyond the house itself, but the conditions and amenities that surround such, to explain the deeper meaning behind “location”. The first source was from the U.S Census Bureau, as a beginning point in obtaining geographic features for visual display, broad context on parcels and neighborhoods, and overall socioeconomic data.
The next source of data, similar to the U.S Census, was leveraged by Social Explorer. This website takes U.S Census data and creates a more user-friendly and comprehensive database that allowed us to efficiently seek neighborhood-level information that we can use to experiment with our model. The website fostered our ideas of potential indicators of sale price beyond the housing market, rather amenities surrounding it. This exploration further cascaded to investigate less conventional attributes to a neighborhood that can summarize aspects in a specified quantitative measurement. When taking all of these variables into account, we were able to predict with substantial accuracy the prices of various homes throughout Philadelphia.
We download the following data from the census: Population Estimate, Total Housing Units, Total Vacant Units, Median Household Income, Households with Public Assisted Income, Owner-Occupied Housing Units, White Homeowners, and Total Occupied Housing Units. These variables provide a basic overview of housing in Philadelphia but are not sufficient alone in predicting future sale prices.
census_api_key("2ad9e737f3d9062836cb46bb568be5467f86d3db", overwrite = TRUE)
OpenData Philly is an open source website that provides a catalog of free data, officially sponsored by the City of Philadelphia. The data provided by the City and other organizations allows us to collect a wide variety of information we can utilize to categorize, describe, and develop a profile for each neighborhood. We used this website to gather
planning_districts <- st_read("https://opendata.arcgis.com/datasets/0960ea0f38f44146bb562f2b212075aa_0.geojson")%>%
st_transform('EPSG:2272')
## Reading layer `Planning_Districts' from data source
## `https://opendata.arcgis.com/datasets/0960ea0f38f44146bb562f2b212075aa_0.geojson'
## using driver `GeoJSON'
## Simple feature collection with 18 features and 8 fields
## Geometry type: POLYGON
## Dimension: XY
## Bounding box: xmin: -75.28031 ymin: 39.86747 xmax: -74.95575 ymax: 40.13793
## Geodetic CRS: WGS 84
redevelopment_areas <- st_read("https://data-phl.opendata.arcgis.com/datasets/80f2c71305f5493c8e0aab9137354844_0.geojson") %>%
dplyr::filter(TYPE == 'Redevelopment Area Plan and Blight Certification' & STATUS == 'Active') %>%
st_transform('EPSG:2272')
## Reading layer `Redevelopment_Certified_Areas' from data source
## `https://data-phl.opendata.arcgis.com/datasets/80f2c71305f5493c8e0aab9137354844_0.geojson'
## using driver `GeoJSON'
## Simple feature collection with 71 features and 13 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -75.26513 ymin: 39.87596 xmax: -75.01541 ymax: 40.08466
## Geodetic CRS: WGS 84
restaurants <- st_read('https://opendata.arcgis.com/datasets/53b8a1c653a74c92b2de23a5d7bf04a0_0.geojson')%>%
st_transform('EPSG:2272') %>%
dplyr::select(GEOID10,TOTAL_RESTAURANTS)
## Reading layer `370eb703-5a4f-4abb-8920-727cef31373b2020329-1-rcimn.5n39o' from data source `https://opendata.arcgis.com/datasets/53b8a1c653a74c92b2de23a5d7bf04a0_0.geojson'
## using driver `GeoJSON'
## Simple feature collection with 1336 features and 16 fields
## Geometry type: POLYGON
## Dimension: XY
## Bounding box: xmin: -75.28031 ymin: 39.86747 xmax: -74.95575 ymax: 40.13793
## Geodetic CRS: WGS 84
restaurants <- restaurants %>%
mutate(rest_per_sqmi = as.numeric(TOTAL_RESTAURANTS / (st_area(restaurants)/358700000)))
The data here extracts the samples of house data we were provided prior to the our model and studies. The database and attributes included creates a base-model of indicators to predict house price that focus exclusively on the house’s features. Prior data mentioned focuses on the external factors while this data focuses on the internal. The attributes we look for in the house data needs to be quantifiable and relevant. Large sets of data were ranked and categorized which required us to manually set a value that can be calculated by the model. Beyond the various factors that can be considered for the predictive model, the general table will be filtered and selected to determine only the most important indicators.
house_data <- st_read("https://raw.githubusercontent.com/mafichman/musa_5080_2023/main/Midterm/data/2023/studentData.geojson") %>%
dplyr::select("objectid","central_air","exterior_condition", "fireplaces", "garage_spaces", "interior_condition", "mailing_street", "location", "number_of_bathrooms", "number_of_bedrooms","number_of_rooms","number_stories","sale_price","sale_date","total_area","total_livable_area","year_built","toPredict") %>%
st_transform('EPSG:2272')
## Reading layer `studentData' from data source
## `https://raw.githubusercontent.com/mafichman/musa_5080_2023/main/Midterm/data/2023/studentData.geojson'
## using driver `GeoJSON'
## Simple feature collection with 23881 features and 75 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: -75.27367 ymin: 39.89162 xmax: -74.9618 ymax: 40.1377
## Geodetic CRS: WGS 84
#Do some feature enginerring
house_data <- house_data %>%
mutate(
#Recode exterior condition so that four is best condition, group some categories together
exterior_condition = as.numeric(recode(exterior_condition, '7'='1', '6'='2', '5'='3', '4'='3', '3'='4', '2'='4', '1'='4')),
#Assumed averager condition is value is NA
exterior_condition = ifelse(is.na(exterior_condition),2,exterior_condition),
#Set blank values to 0, assumed that if value is blank there are no fireplaces based on metadata.
fireplaces = ifelse(is.na(fireplaces),0,fireplaces),
#Assumed there is always at least one bathroom - if property has 0 bathrooms assigned a value of 1.
number_of_bathrooms = ifelse(number_of_bathrooms == 0,1,number_of_bathrooms),
number_of_bathrooms = ifelse(is.na(number_of_bathrooms),1,number_of_bathrooms),
#Assumed averager condition is value is NA
interior_condition = ifelse(is.na(interior_condition),4,interior_condition))
We look at the percent of students at public schools who had proficient or advanced scores on the PSSA/Keystone test. The PSSA/Keystone includes Science, Math, and English Language Arts score. For each school, we averaged together the students performance on the three sections to get a single number for each school. We then determined the nearest school to each home and assigned the home the test score value of that school.
We download data on trees as points and calculate the number of trees per square mile in each census tract. Trees are a way of indicating aesthetic appeal, environmental benefits, and shade. We extracted this data from OpenDataPhilly and joined the data from such to the census tracts.
#Get Data on trees in Philadelphia
trees <- st_read('https://opendata.arcgis.com/api/v3/datasets/5abe042f2927486891c049cf064338cb_0/downloads/data?format=geojson&spatialRefId=4326&where=1%3D1')%>%
st_transform('EPSG:2272')
## Reading layer `PPR_Tree_Inventory_2022' from data source
## `https://opendata.arcgis.com/api/v3/datasets/5abe042f2927486891c049cf064338cb_0/downloads/data?format=geojson&spatialRefId=4326&where=1%3D1'
## using driver `GeoJSON'
## Simple feature collection with 151164 features and 6 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: -75.28272 ymin: 39.87498 xmax: -74.95878 ymax: 40.13746
## Geodetic CRS: WGS 84
#Calculate the number of trees per census tract and convert to trees per square mile to normalize
acsTractsPHL.2021 <- st_intersection(acsTractsPHL.2021, trees) %>% #Determine what census tract each tree is in using intersection
st_drop_geometry() %>%
group_by(GEOID) %>% summarize(tree_count = n()) %>% #Count the number of trees in each Census tract
right_join(acsTractsPHL.2021, by='GEOID') %>% #Join tree cont to census tract Dataframe
st_sf() %>%
mutate(trees_per_mi = as.double(tree_count / (st_area(acsTractsPHL.2021)/358700000)))
## Warning: attribute variables are assumed to be spatially constant throughout
## all geometries
# ggplot()+
# geom_sf(data=acsTractsPHL.2021, aes(fill=q5(trees_per_mi)))+
# geom_sf(data=planning_districts,fill='transparent',color='red',linewidth=1)
We calculate the distance to the nearest urban corridor. Urban corridors, for most houses, are considered proximity to amenities, decrease transportation costs, and has an overall higher quality of life. It has to also be noted that outliers are present in suburban housing too. For an urban development in general Philadelphia, being close to urban corridors would be considered an amenities, raising the home value. For suburban areas on the outskirts of the city, these car-dependent neighborhoods want to be away from corridors, preferring quietness, privacy, and distance.
corridors_url <- "https://opendata.arcgis.com/datasets/f43e5f92d34e41249e7a11f269792d11_0.geojson"
corridors <- st_read(corridors_url, quiet= TRUE) %>% st_transform('EPSG:2272')
nearest_fts <- st_nearest_feature(house_data,corridors)
house_data$dist_urb_corridor <- as.double(st_distance(house_data, corridors[nearest_fts,], by_element=TRUE))
We calculate the number of shootings within a 1/2 mile and 1/4 mile radius of each home. Shootings can indicate home values because of their impact on the perception of crime in the area and the buyer’s concern for personal safety or property damage. This database was sourced directly from the Carto, another open source website.
shootings_url <- "https://phl.carto.com/api/v2/sql?q=SELECT+*+FROM+shootings&filename=shootings&format=geojson&skipfields=cartodb_id"
shootings <- st_read(shootings_url) %>% st_transform('EPSG:2272') %>%
mutate(date_=as.Date(date_, format = "%d-%m-%Y")) %>%
dplyr::filter(date_ > '2023-01-01') %>%
dplyr::select(location)
## Reading layer `OGRGeoJSON' from data source
## `https://phl.carto.com/api/v2/sql?q=SELECT+*+FROM+shootings&filename=shootings&format=geojson&skipfields=cartodb_id'
## using driver `GeoJSON'
## replacing null geometries with empty geometries
## Simple feature collection with 15057 features and 21 fields (with 437 geometries empty)
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: -75.27362 ymin: 39.88409 xmax: -74.95936 ymax: 40.13117
## Geodetic CRS: WGS 84
house_data <- st_intersection(shootings,st_buffer(house_data,2640)) %>%
st_drop_geometry() %>%
count(objectid) %>%
rename(shooting_halfmile = n) %>%
right_join(house_data, by='objectid') %>%
mutate(shooting_halfmile = ifelse(is.na(shooting_halfmile),0,shooting_halfmile)) %>%
st_sf()
## Warning: attribute variables are assumed to be spatially constant throughout
## all geometries
house_data <- st_intersection(shootings,st_buffer(house_data,2640/2)) %>%
st_drop_geometry() %>%
count(objectid) %>%
rename(shooting_quartermile = n) %>%
right_join(house_data, by='objectid') %>%
mutate(shooting_quartermile = ifelse(is.na(shooting_quartermile),0,shooting_quartermile)) %>%
st_sf()
## Warning: attribute variables are assumed to be spatially constant throughout
## all geometries
After gathering data from various sources, we now have to coalesce them into one final database for our prediction model.
HDJoin <- st_intersection(house_data, restaurants %>%
dplyr::select("TOTAL_RESTAURANTS", "rest_per_sqmi"))
## Warning: attribute variables are assumed to be spatially constant throughout
## all geometries
HDJoin <- st_intersection(HDJoin, planning_districts %>%
dplyr::select("DIST_NAME"))
## Warning: attribute variables are assumed to be spatially constant throughout
## all geometries
HDJoin <- st_join(HDJoin,schools %>% dplyr::select('mean_profadv'), join=st_nearest_feature)
HDJoinRDAs <- HDJoin[st_intersects(HDJoin, redevelopment_areas) %>% lengths > 0, ] %>%
mutate(DevelopmentArea = 1)
NotRDAs <- HDJoin[!st_intersects(HDJoin, redevelopment_areas) %>% lengths > 0, ] %>%
mutate(DevelopmentArea = 0)
HDBind <- rbind(NotRDAs, HDJoinRDAs)
HDFinal <- st_intersection(HDBind, acsTractsPHL.2021 %>%
dplyr::select(-NAME)) %>%
dplyr::filter(toPredict == "MODELLING")
## Warning: attribute variables are assumed to be spatially constant throughout
## all geometries
C <- ggplot()+
geom_sf(data=restaurants,aes(fill = q5(rest_per_mi)))
D <- ggplot()+
geom_sf(data=restaurants,aes(fill = q5(TOTAL_RESTAURANTS)))
The table of Summary Statistics provides us a basis on the range of values and statistical information about each variable we have in our database. With all these factors considered, we can collectively use these variables to predict home prices and provide a visualization of how related specific variables are at indicating such.
HDFinal_nongeom <- HDFinal %>% st_drop_geometry()
HDFinal_nongeom <- HDFinal_nongeom %>%
dplyr::select_if(is.numeric) %>%
dplyr::select(-objectid, -totalPop, -year_built, -totalHU, -number_of_rooms, -TotalOccH)
stargazer(HDFinal_nongeom, type = 'text', title= "Table 1: Summary Statistics")
##
## Table 1: Summary Statistics
## ==========================================================================
## Statistic N Mean St. Dev. Min Max
## --------------------------------------------------------------------------
## shooting_quartermile 23,779 3.764 5.461 0 45
## shooting_halfmile 23,779 13.644 15.507 0 99
## exterior_condition 23,779 3.168 0.460 1 4
## fireplaces 23,779 0.048 0.289 0 5
## garage_spaces 23,358 0.357 0.706 0 72
## interior_condition 23,779 3.647 0.889 0 7
## number_of_bathrooms 23,779 1.215 0.517 1 8
## number_of_bedrooms 23,763 2.583 1.267 0 31
## number_stories 23,771 1.958 0.551 1 5
## sale_price 23,779 272,984.900 239,942.500 10,100 5,990,000
## total_area 23,705 1,828.545 3,823.931 0 226,512
## total_livable_area 23,779 1,333.585 549.196 0 15,246
## dist_urb_corridor 23,779 670.238 638.845 0.000 7,383.925
## TOTAL_RESTAURANTS 23,779 4.333 7.314 0 174
## rest_per_sqmi 23,779 1,040.975 1,783.240 0.000 37,464.860
## mean_profadv 23,779 28.770 17.352 4.285 92.333
## DevelopmentArea 23,779 0.156 0.363 0 1
## tree_count 23,779 379.244 246.684 40 1,832
## totalVacant 23,779 202.342 136.300 0 691
## medHHInc 23,779 59,797.020 28,650.770 11,955.000 210,322.000
## HHAssistedInc 23,779 513.927 358.376 0 2,048
## OwnerOccH 23,779 1,094.438 475.065 0 2,685
## WhiteHomeowner 23,779 880.528 698.197 0 2,706
## HHOccupiedRate 23,779 0.514 0.159 0.000 0.877
## WhiteHOrate 23,779 0.450 0.316 0.000 0.970
## AssistedIncRate 23,779 0.243 0.154 0.000 0.681
## trees_per_mi 23,779 27,030.210 28,708.110 1,069.279 215,615.900
## --------------------------------------------------------------------------
A correlation matrix compares factors against one another to see how related they are at determining the others value. Each box presented can indicate whether there is a correlation (directly or inverse) or little relationship between them. We are able to see strong correlation between many factors. We are interested in all variables that show a significant relationship with sale price, specifically taking those with an absolute value of 0.3 or greater.
HDFinal_nongeom %>%
correlate() %>%
autoplot() +
geom_text(aes(label = round(r,digits=2)), size = 3.5, order = "hclust", type = "upper", tl.cex = 3)
## Correlation computed with
## • Method: 'pearson'
## • Missing treated using: 'pairwise.complete.obs'
## Registered S3 methods overwritten by 'registry':
## method from
## print.registry_field proxy
## print.registry_entry proxy
## Warning in geom_text(aes(label = round(r, digits = 2)), size = 3.5, order =
## "hclust", : Ignoring unknown parameters: `order`, `type`, and `tl.cex`
# Dependent Varaiable Correlations The four graphs below represent four
key dependent variables that broadly cover each aspect of a neighborhood
in order to determine house prices. Assited Income focuses on the
homeowner’s socioeconomic status, resulting in an inverse relationship.
Tree count and total livable area are directly correlated with a house’s
desirability because of home size being considered a luxury and more
trees often indicates better care for a neighborhood. The last dependent
variable we look at is the white homeowner rate. White homeownership can
indicate a “perceived” higher price value because of racial segregation,
wealth inequality, and potential gentrification.
## Variables: total_livable_area, tree_count, WhiteHomeowner, AssistedIncRate
HDFinal_nongeom %>%
dplyr::select(sale_price, total_livable_area, tree_count, WhiteHOrate, AssistedIncRate) %>%
filter(sale_price <= 1000000) %>%
gather(Variable, Value, -sale_price) %>%
ggplot(aes(Value, sale_price)) +
geom_point(size = .5) + geom_smooth(method = "lm", se=F, colour = "#FA7800") +
facet_wrap(~Variable, nrow = 1, scales = "free") +
labs(title = "Price as a function of continuous variables") +
plotTheme()
## Warning: The `size` argument of `element_rect()` is deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: The `size` argument of `element_line()` is deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `geom_smooth()` using formula = 'y ~ x'
This map explores the variation in 2021 home prices across Philadelphia. We noticed the most expensive areas are located near Center City or far outskirts such as Chestnut Hill. We notice a ring-like effect in the city where the most expensive areas are either in the core center where all the activity is, or furthest away where there is the least interaction with the city.
HDFinal <- HDFinal %>%
mutate(sale_class = cut(sale_price, breaks = c(0, 250000, 500000, 750000, 1000000, max(HDFinal$sale_price, na.rm=TRUE))))
ggplot()+
geom_sf(data=planning_districts,fill='grey80',color='transparent')+
geom_sf(data=HDFinal, size=0.75,aes(colour = q5(sale_class)))+
geom_sf(data=planning_districts,fill='transparent',color='black')+
scale_color_manual(values = palette5,
name = "Sales Price (USD)",
na.value = 'grey80',
labels = c('$0-$250k', '$250k-$500k', '$500k-$750k', '$750k-$1m', '$1m+'))+
mapTheme()
# Interesting Map 1, Assisted Income
This map shows the rate of people requiring assisted income in the city. We can observe that the central northeast, areas such as Kensington, Fairhill, and Feltonville are most reliant on government assistance for income. And the surrounding area may be affected by the poverty that is prominent there through perimeters that forms, surrounding the “yellow” core.
ggplot()+
geom_sf(data=planning_districts,fill='grey80',color='transparent')+
geom_sf(data=HDFinal,aes(colour = (AssistedIncRate * 100)),size=0.5)+
scale_color_viridis(name = "Assisted Income Per 100 Households")+
geom_sf(data=planning_districts,fill='transparent',color='black')+
mapTheme()
The school system in a neighborhood can be a deal-breaker factor for families seeking a home in an area. The Keystone Exam is the State’s index of evaluating school performance and student’s education quality.
#profadv is the percentage of students recieved proficient OR advanced scores on Keystore Exam by Penn DOE
ggplot()+
geom_sf(data=planning_districts,fill='grey80',color='transparent')+
geom_sf(data=schools,aes(colour = q5(mean_profadv)),size=2.5)+
scale_color_viridis_d(name = "Student Performance on Keystone Exam", labels = c('Poor', 'Below Average', 'Average', 'Above Average', 'Excellent'))+
geom_sf(data=planning_districts,fill='transparent',color='black')+
mapTheme()
# Interesting Map 3, Heat Map of Shootings
We depicted where all shootings have occurred and created a heat map that shows the “hot spot” of where shootings most often occur. Similarly to seeing the education ratings & government assistance, we notice crime is clustered just north of the city center and we it has residual spillover in areas north of said cluster. Central City and areas south of the largest hot spot appear spared from gun violence.
ggplot()+
geom_sf(data=planning_districts,fill='grey80',color='transparent')+
stat_density2d(data = data.frame(st_coordinates(shootings)),
aes(X, Y, fill = ..level.., alpha = ..level..),
size = 0.01, bins = 40, geom = 'polygon') +
scale_fill_gradient(low = "white", high = "red", name = "Shootings")+
scale_alpha(guide = "none") +
labs(title = "Density of Reported Shootings") +
geom_sf(data=planning_districts,fill='transparent',color='black')+
mapTheme()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: The dot-dot notation (`..level..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(level)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Removed 15 rows containing non-finite values (`stat_density2d()`).
# Might want to delete this later lmao idk
ggplot()+
geom_sf(data=planning_districts,fill='grey90',color='transparent')+
geom_sf(data=HDFinal,aes(colour = q5(sale_class)),size=0.5)+
geom_sf(data=redevelopment_areas,fill='transparent',color='black')+
scale_color_manual(values = palette5,
name = "Sales Price (USD)",
na.value = 'grey80',
labels = c('$0-$250k', '$250k-$500k', '$500k-$750k', '$750k-$1m', '$1m+'))+
mapTheme()
# “Something Engaging” Interactive Map of Houseowner-Occupied Rate
Provided below is an map that depicts where the Homeowner occupies the
house they reside in, deferring from places where people rent. If you
hover above the tract, you’ll see the exact rate of home ownership in
the area. Other tools are available on the top right of the map.
ggplotly(
ggplot(acsTractsPHL.2021)+
geom_sf(aes(fill = HHOccupiedRate))+
scale_fill_gradient(low = "black", high = "yellow", name = "Rate of Home Ownership")
)
We have decided on a filtered list of key dependent variables to use to predict home values. Those factors being the following: shootings within 1/2 mile, interior condition, total livable area, neighborhood name, trees in the area, median household income, exterior condition, if there is a fireplace, white homeownership rate, assisted income rate, restaurants per square mile, and the number of bathrooms.
inTrain <- createDataPartition(
y = paste(HDFinal$DIST_NAME),
p = .60, list = FALSE)
Philly.training <- HDFinal[inTrain,]
Philly.test <- HDFinal[-inTrain,]
reg.training <-
lm(sale_price ~ ., data = as.data.frame(Philly.training) %>%
dplyr::select(sale_price, shooting_halfmile, interior_condition, total_livable_area,
DIST_NAME, mean_profadv,
DevelopmentArea, tree_count, medHHInc,
exterior_condition, fireplaces, WhiteHOrate, AssistedIncRate,
rest_per_sqmi, number_of_bathrooms))
summary(reg.training)
##
## Call:
## lm(formula = sale_price ~ ., data = as.data.frame(Philly.training) %>%
## dplyr::select(sale_price, shooting_halfmile, interior_condition,
## total_livable_area, DIST_NAME, mean_profadv, DevelopmentArea,
## tree_count, medHHInc, exterior_condition, fireplaces,
## WhiteHOrate, AssistedIncRate, rest_per_sqmi, number_of_bathrooms))
##
## Residuals:
## Min 1Q Median 3Q Max
## -1646927 -54701 -716 50017 4217280
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.852e+04 2.021e+04 3.886 0.000102 ***
## shooting_halfmile -1.088e+03 1.223e+02 -8.896 < 2e-16 ***
## interior_condition -3.034e+04 1.883e+03 -16.112 < 2e-16 ***
## total_livable_area 1.924e+02 2.642e+00 72.837 < 2e-16 ***
## DIST_NAMECentral Northeast -1.828e+05 8.302e+03 -22.020 < 2e-16 ***
## DIST_NAMELower Far Northeast -1.976e+05 8.180e+03 -24.159 < 2e-16 ***
## DIST_NAMELower North -1.671e+05 8.025e+03 -20.826 < 2e-16 ***
## DIST_NAMELower Northeast -1.821e+05 8.196e+03 -22.213 < 2e-16 ***
## DIST_NAMELower Northwest -2.247e+05 7.022e+03 -31.999 < 2e-16 ***
## DIST_NAMELower South -2.091e+05 2.479e+04 -8.437 < 2e-16 ***
## DIST_NAMELower Southwest -1.986e+05 1.035e+04 -19.189 < 2e-16 ***
## DIST_NAMENorth -1.995e+05 8.321e+03 -23.978 < 2e-16 ***
## DIST_NAMENorth Delaware -1.961e+05 7.461e+03 -26.288 < 2e-16 ***
## DIST_NAMERiver Wards -2.086e+05 6.805e+03 -30.653 < 2e-16 ***
## DIST_NAMESouth -1.539e+05 6.031e+03 -25.522 < 2e-16 ***
## DIST_NAMEUniversity Southwest -1.719e+05 9.871e+03 -17.415 < 2e-16 ***
## DIST_NAMEUpper Far Northeast -2.174e+05 9.030e+03 -24.078 < 2e-16 ***
## DIST_NAMEUpper North -1.917e+05 8.390e+03 -22.850 < 2e-16 ***
## DIST_NAMEUpper Northwest -2.051e+05 8.470e+03 -24.213 < 2e-16 ***
## DIST_NAMEWest -1.761e+05 8.663e+03 -20.330 < 2e-16 ***
## DIST_NAMEWest Park -2.003e+05 1.152e+04 -17.393 < 2e-16 ***
## mean_profadv 2.148e+03 1.042e+02 20.613 < 2e-16 ***
## DevelopmentArea 5.295e+03 3.969e+03 1.334 0.182218
## tree_count 3.629e+01 6.430e+00 5.643 1.70e-08 ***
## medHHInc 3.661e-01 8.327e-02 4.397 1.10e-05 ***
## exterior_condition 1.644e+04 3.399e+03 4.836 1.34e-06 ***
## fireplaces 7.413e+04 4.498e+03 16.480 < 2e-16 ***
## WhiteHOrate 6.230e+04 9.989e+03 6.237 4.60e-10 ***
## AssistedIncRate -4.526e+04 1.549e+04 -2.922 0.003486 **
## rest_per_sqmi 7.365e+00 7.966e-01 9.245 < 2e-16 ***
## number_of_bathrooms 4.696e+04 2.862e+03 16.405 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 141200 on 14244 degrees of freedom
## Multiple R-squared: 0.6768, Adjusted R-squared: 0.6761
## F-statistic: 994.2 on 30 and 14244 DF, p-value: < 2.2e-16
For this process we are now teaching the model to predict the housing prices. We have provided data points as mentioned prior each with their own set of values correlated to sale price. Now, the model learns that relationship and produces its own set of values when given new data.
For this table below specifically, you see how far off it is from the actual values it compares to, known as their absolute error. Because we are working with large value commodities (i.e housing), the increment the model is off by can be quite large. It is equally important to look at the percentage value to see the relative amount it is off by as well.
Philly.test <-
Philly.test %>%
mutate(Regression = "Baseline Regression",
sale_price.Predict = predict(reg.training, Philly.test),
sale_price.Error = sale_price.Predict - sale_price,
sale_price.AbsError = abs(sale_price.Predict - sale_price),
sale_price.APE = (abs(sale_price.Predict - sale_price)) / sale_price.Predict)%>%
filter(sale_price < 5000000)
Philly.test %>%
st_drop_geometry() %>%
summarise(MAE = mean(sale_price.AbsError),
MAPE = abs(mean(sale_price.APE)*100)) %>%
kbl(col.name=c('Mean Absolute Error','Mean Absolute Percentage Error')) %>%
kable_classic()
| Mean Absolute Error | Mean Absolute Percentage Error |
|---|---|
| 75709.16 | 35.79384 |
A Nearest Neighbor analysis is our way of identifying the distribution of points in a given space, whether by measuring distance or a specific value. For our analysis, we can use Nearest Neighbor values to additionally predict sale prices, as neighboring homes would tend to feature the same attributes as one another, therefore, can have a similar price range.
coords <- st_coordinates(HDFinal)
neighborList <- knn2nb(knearneigh(coords, 5))
## Warning in knearneigh(coords, 5): knearneigh: identical points found
## Warning in knearneigh(coords, 5): knearneigh: kd_tree not available for
## identical points
spatialWeights <- nb2listw(neighborList, style="W")
HDFinal$lagPrice <- lag.listw(spatialWeights, HDFinal$sale_price)
coords.test <- st_coordinates(Philly.test)
neighborList.test <- knn2nb(knearneigh(coords.test, 5))
## Warning in knearneigh(coords.test, 5): knearneigh: identical points found
## Warning in knearneigh(coords.test, 5): knearneigh: kd_tree not available for
## identical points
spatialWeights.test <- nb2listw(neighborList.test, style="W")
Philly.test %>%
mutate(lagPriceError = lag.listw(spatialWeights.test, sale_price.Error)) %>%
ggplot()+
geom_point(aes(x =lagPriceError, y =sale_price.Error))
# Morans I Moran’s I is a measurement in statistics that evaluates the
tendency for data points with similar values to occur near one another.
Similar to Nearest Neighbor to statistically prove that houses spatially
close to each other would have similar sale prices. This type of
measurement helps us identify high and low value clusters (i.e different
neighborhoods and their perceived value).
In the graph below see a thin distribution in gray, those are the counts of houses and their sale prices we provided the regression model. The orange line represents the Moran’s I value, determining if there is any correlation between value and location. The observed Moran’s I is roughly 0.2, which is considered statistically significant and suggests that the spatial distribution is correlated.
moranTest <- moran.mc(Philly.test$sale_price.Error,
spatialWeights.test, nsim = 999)
ggplot(as.data.frame(moranTest$res[c(1:999)]), aes(moranTest$res[c(1:999)])) +
geom_histogram(binwidth = 0.01) +
geom_vline(aes(xintercept = moranTest$statistic), colour = "#FA7800",size=1) +
scale_x_continuous(limits = c(-1, 1)) +
labs(title="Observed and permuted Moran's I",
subtitle= "Observed Moran's I in orange",
x="Moran's I",
y="Count") +
plotTheme()
## Warning: Removed 2 rows containing missing values (`geom_bar()`).
# Table of Average Predictions The table below shows the the average
prediction between what the model created and the actual average price
from the data we provided it. We see there is a reasonably strong level
of accuracy between what is predicted and what is actual. What needs to
be considered is that these are the averages, and that the predicted
price in any district can vary widely.
Philly.test %>%
as.data.frame() %>%
group_by(DIST_NAME) %>%
summarize(meanPrediction = mean(sale_price.Predict),
meanPrice = mean(sale_price)) %>%
kable() %>%
kable_styling()
| DIST_NAME | meanPrediction | meanPrice |
|---|---|---|
| Central | 609764.50 | 579944.62 |
| Central Northeast | 287054.37 | 287035.44 |
| Lower Far Northeast | 286284.26 | 289399.58 |
| Lower North | 230382.84 | 229935.83 |
| Lower Northeast | 174741.18 | 173679.61 |
| Lower Northwest | 367666.82 | 370908.29 |
| Lower South | 457259.83 | 507786.96 |
| Lower Southwest | 130254.43 | 127275.47 |
| North | 93134.25 | 94969.27 |
| North Delaware | 211661.65 | 211547.08 |
| River Wards | 248720.35 | 257434.39 |
| South | 317909.04 | 319461.05 |
| University Southwest | 243080.49 | 246246.59 |
| Upper Far Northeast | 360528.93 | 353300.65 |
| Upper North | 179616.76 | 175380.92 |
| Upper Northwest | 335241.08 | 358161.42 |
| West | 160516.95 | 152537.70 |
| West Park | 234824.98 | 222589.28 |
Here we compare how including the neighborhood autocorrelation changes the sale price predictions compared to when without it. As shown from the graphs before, there is a statistically significant relationship between the sale price of houses and the neighborhood they are classified to be in. We will create a new model that compares the accuracy of with and without the neighborhood correlation.
reg.nhood <- lm(sale_price ~ ., data = as.data.frame(Philly.training) %>%
dplyr::select(sale_price, shooting_halfmile, interior_condition, total_livable_area,
DIST_NAME, mean_profadv,
DevelopmentArea, tree_count, medHHInc,
exterior_condition, fireplaces, WhiteHOrate, AssistedIncRate,
rest_per_sqmi, number_of_bathrooms))
Philly.test.nhood <-
Philly.test %>%
mutate(Regression = "Neighborhood Effects",
sale_price.Predict = predict(reg.nhood, Philly.test),
sale_price.Error = sale_price.Predict- sale_price,
sale_price.AbsError = abs(sale_price.Predict- sale_price),
sale_price.APE = (abs(sale_price.Predict- sale_price)) / sale_price)%>%
filter(sale_price < 5000000)
bothRegressions <-
rbind(
dplyr::select(Philly.test, starts_with("sale_price"), Regression, DIST_NAME) %>%
mutate(lagPriceError = lag.listw(spatialWeights.test, sale_price.Error)),
dplyr::select(Philly.test.nhood, starts_with("sale_price"), Regression, DIST_NAME) %>%
mutate(lagPriceError = lag.listw(spatialWeights.test, sale_price.Error)))
st_drop_geometry(bothRegressions) %>%
gather(Variable, Value, -Regression, -DIST_NAME) %>%
filter(Variable == "sale_price.AbsError" | Variable == "sale_price.APE") %>%
group_by(Regression, Variable) %>%
summarize(meanValue = mean(Value, na.rm = T)) %>%
spread(Variable, meanValue) %>%
kable()
## Warning: attributes are not identical across measure variables; they will be
## dropped
## `summarise()` has grouped output by 'Regression'. You can override using the
## `.groups` argument.
| Regression | sale_price.AbsError | sale_price.APE |
|---|---|---|
| Baseline Regression | 75709.16 | 0.3579384 |
| Neighborhood Effects | 75709.16 | 0.4117570 |
The graph below shows overall accuracy of the two models created compared to actual prices used to train the machine. The neighborhood effect model, while the numbers can show a fairly noticeable increase in accuracy, when graphed, we noticed a minimal change in its predictions. One potential reason is that the model already had some form of “neighborhood effect” that was pre-calculated in the model, and so adding the additional information was mere reinforcement that could only create so much additionality to its predictions.
bothRegressions %>%
dplyr::select(sale_price.Predict, sale_price, Regression) %>%
ggplot(aes(sale_price, sale_price.Predict)) +
geom_point() +
stat_smooth(aes(sale_price, sale_price),
method = "lm", se = FALSE, size = 1, colour="#FA7800") +
stat_smooth(aes(sale_price.Predict, sale_price),
method = "lm", se = FALSE, size = 1, colour="#25CB10") +
facet_wrap(~Regression) +
labs(title="Predicted sale price as a function of observed price",
subtitle="Orange line represents a perfect prediction; Green line represents prediction") +
plotTheme()
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
# Spatial Pattern of Both Regressions We can continue to see that there
is a correlation between neighborhoods categories and a home’s sale
price. This spatial autocorrelation is defined by these two maps that
show the difference the neighborhood effect has on predictions. When
depicted by district, the map informed us that the baseline model
becomes less reliable in the north central area, specifically where
there is the lowest values in housing as seen in prior figures.
st_drop_geometry(bothRegressions) %>%
group_by(Regression, DIST_NAME) %>%
summarize(mean.MAPE = mean(sale_price.APE, na.rm = T)) %>%
ungroup() %>%
left_join(planning_districts) %>%
st_sf() %>%
ggplot() +
geom_sf(aes(fill = mean.MAPE)) +
geom_sf(data = bothRegressions, colour = "black", size = .5) +
facet_wrap(~Regression) +
scale_fill_gradient(low = palette5[1], high = palette5[5],
name = "MAPE") +
labs(title = "Mean test set MAPE by neighborhood") +
mapTheme()
## `summarise()` has grouped output by 'Regression'. You can override using the
## `.groups` argument.
## Joining with `by = join_by(DIST_NAME)`
These two maps show the significant income gap between races in Philadelphia. There is an evident parallel between higher income areas and fewer minority populations. There are many reasons historically, politically, and through socioeconomic that create and define these boundaries. But that is also exactly why incorporating race as a factor in the models prior were not recommended.
Adding race to a model adds a cascading effect of discrimination that feeds future predictions to assume race is a pre-determined factor that effects house values. Race is only a factor in housing models because of historical systemic racism and discriminatory policies that created parallels between racial and economic disparity.
grid.arrange(ncol = 2,
ggplot() +
geom_sf(data=planning_districts,fill='beige',color='transparent')+
geom_sf(data = na.omit(acsTractsPHL.2021), aes(fill = WhiteHOrate)) +
scale_fill_gradient(low = "black", high = "white", name = "Percentage of White Homeowners")+
labs(title = "Race Context") +
mapTheme() + theme(legend.position="bottom"),
ggplot() +
geom_sf(data=planning_districts,fill='beige',color='transparent')+
geom_sf(data = na.omit(acsTractsPHL.2021), aes(fill = medHHInc)) +
scale_fill_gradient(low = "black", high = "green", name = "Median Household Income",
na.value = 'grey80',
labels = c('$0', '$50k', '$100k', '$150k', '$200k'))+
labs(title = "Income Context") +
mapTheme() + theme(legend.position="bottom"))